The following Tutorial is the final assessment of the project seminar “Species Distribution Modeling” at Philipps-University Marburg. In this tutorial we’re going to use the XGBoost algorithm to predict the specie´s distribution of butterflies in Pakistan and create a species richness map of the country. XGBoost (eXtreme Gradient Boosting) is a popular machine learning algorithm that belongs to the family of gradient boosting methods. It was developed by Tianqi Chen. and uses a combination of gradient boosting, decision trees, regularization, gradient-based optimization, feature importance analysis, parallelization. All this make’s it a robust and powerful algorithm that often delivers state-of-the-art results in various machine learning tasks. You will be introduced to the basic concepts of XGBoost and you’ll be provided with a reproducible workflow to use XGBoost to build classification models.
XGBoost is a ensemble Method same as Random Forrest, this means it combines the output of multiple Trees. But the methods differ in the way the idividual Trees are build and how the results are combined. In Xgboost the Output oft the Trees aren’t combined equally. Instead XGBoost uses a method called boosting. Boosting combines weak learner (small trees) sequentually so that the new tree corrects the errors oft he previous one. To understand this we have to look into some mathematical details. But dont worry when using XGBoost these details will be automated. Nevertheless its importand to understand these processes to optimize the algorithm later on.
As said XGBoost builds on many concepts to deliver it’s often outstanding results. We’re going to start with the mathematical base concepts of how XGBoost builds trees.
In this assessment were trying to classify geo-points if they’re potential habitats for a number of butterfly species. In order to do that we need XGBoost to build classification trees. XGBoost work’s also with regression but the process differs a little, so were not going to focus on that.
How XGBoost builds trees is limited by multiple regularization parameters:
We’ve heard of Lambda when we’re calculated the similarity-score. XGBoost default value for Lambda is 0 therefore we’ve been ignoring it. But when Lambda is set to >0 the similarity-score get’s smaller because the denominator becomes larger. Thus Lambda prevents over-fitting.
Another regularization parameter is the Cover or min_child_weight. This parameter is also the reason why we haven’t continued building our example tree. In XGBoost the default value for the cover is 1 wich means that every leaf with a cover value less than 1 doesn’t get build. The cover for classification tree’s is calculated by summing the previous probability times 1 minus the previous probability, for each residual in the leaf.
Similar to Cover (min_child_weigth) Gamma is a regularization parameter that causes XGBoost to only build new leafs when the Gain-Value is larger than Gamma.
Therefore it prevents overfitting the trees to our data. Gamma is a highly specialized regularization parameter, what mean’s that there is no “good” value. By default it’s 0 therefore no regularization takes place.
XGBoost substract’s gamma from the Gain-Value and then removes the leaf if the result is a negative number. For example if we take the previous calculated Gain-Value of our example tree of 3.8 a gamma-value of 4 would prune the whole tree down to the root. But if the gamma-value is just 0 XGBoost can build extremly large trees thus overfitting the trees to the dataset and raising the computation time a lot.
XGBoost is available for different programming and scripting languages, include Python, R, Java and C++. Docuementation is available here: https://xgboost.readthedocs.io/en/stable/
Before you dive into the code you need to install some packages this script will use:
install.packages("terra")
install.packages("ggplot2")
install.packages("fastDummies")
install.packages("tidyterra")
The XGBoost package can by installed in two different ways: * First there is the default package from CRAN, which will do it in most situations.
install.packages("xgboost")
TODO: For Benchmarks of a High-End CPU vs Low-End GPU, see https://medium.com/data-design/xgboost-gpu-performance-on-low-end-gpu-vs-high-end-cpu-a7bc5fcd425b. Risks? Reproducability?
##
## !! Installation on windows failed with "Warning in system("sh ./configure.win") 'sh' not found", for dirty Solution see section Troubleshooting
##
# Install dependencies
install.packages(c("data.table", "jsonlite"))
# Install XGBoost
system(paste("R CMD INSTALL ", getwd(), "/xgboost_r_gpu_win64_21d95f3d8f23873a76f8afaad0fee5fa3e00eafe.tar.gz", sep=""))
Don’t forget to enable CPU acceleration in the knitting paramters.
After installing all needed libaries you need to load them:
require(dplyr) # easy dataframe modification
require(ggplot2) # plotting
require(geodata) # downloading geospatial world dataset made easy
require(sf) # simple geospatial features
require(terra) # raster manipulation
require(tidyterra) # plot terra objects with ggplot
require(fastDummies)# create binary factor columns from character column
require(xgboost) # our modeling libary
set.seed("101")
Let’s begin with preparing the data used to train the model. Start with getting a overview of the provided data:
species_occurrences_all <- read.table("data/PakistanLadakh.csv", sep=",", header=TRUE)
species_occurrences_all <- sf::st_as_sf(species_occurrences_all, coords=c("x", "y"), remove=TRUE, crs=sf::st_crs("epsg:4326"))
str(species_occurrences_all)
## Classes 'sf' and 'data.frame': 5870 obs. of 2 variables:
## $ species : chr "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" ...
## $ geometry:sfc_POINT of length 5870; first list element: 'XY' num 73.1 34.4
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
## ..- attr(*, "names")= chr "species"
ggplot() +
geom_sf(data = sf::st_as_sf(species_occurrences_all), mapping=aes(color=species), show.legend = FALSE) +
ggtitle("Oberserved occurrence of butterflies in Pakistan")
Next we need some environmental data to train the model. Therefor we selected the bioclim data, which are widely used in speceis distribution modeling (Source: https://isprs-archives.copernicus.org/articles/XLII-4-W19/449/2019/isprs-archives-XLII-4-W19-449-2019.pdf#:~:text=The%20earliest%20studies%20of%20SDM%20used%20BIOCLIM%20-,requires%20species%20occurrence%20data%20%28latitude%2C%20longitude%2C%20and%20elevation%29.). The Bioclim layers are missing elevation data, we will use those too, since temperature is dependent on elevation. The Border of Pakistan is also needed, we will crop our data with that, so the model doesn’t train areas we don’t not have presence points. The elevation data is alread cropped so we don’t need to repeat this.
Read more here https://www.worldclim.com/bioclim
# political border of pakistan
border_pak <- geodata::gadm(country='PAK', level = 0, path='./data')
ggplot() +
geom_sf(data = sf::st_as_sf(border_pak), fill=NA)
# bioclim data from pakistan
bioclim_pak <- geodata::worldclim_country(country = "PAK", res = 10, var = "bio", path = "data/", version = "2.1")
names(bioclim_pak) <- substr(names(bioclim_pak), 11, 20)
nam <- c("Annual Mean Temperature",
"Mean Diurnal Range",
"Isothermality",
"Temperature Seasonality",
"Max Temperature of Warmest Month",
"Min Temperature of Coldest Month",
"Temperature Annual Range",
"Mean Temperature of Wettest Quarter",
"Mean Temperature of Driest Quarter",
"Mean Temperature of Warmest Quarter",
"Mean Temperature of Coldest Quarter",
"Annual Precipitation",
"Precipitation of Wettest Month",
"Precipitation of Driest Month",
"Precipitation Seasonality",
"Precipitation of Wettest Quarter",
"Precipitation of Driest Quarter",
"Precipitation of Warmest Quarter",
"Precipitation of Coldest Quarter"
)
bioclim_pak <- terra::mask(bioclim_pak, border_pak)
plot(x=bioclim_pak)
# elevation data form pakistan
elevation_pak <- geodata::elevation_30s(country = 'PAK', path = 'data/')
ggplot() +
geom_spatraster(data = elevation_pak) +
geom_sf(data = sf::st_as_sf(border_pak), fill = NA, show.legend = FALSE) +
scale_fill_hypso_c(name = "Elevation")
So lets define our species_occurrences as presence points :
species_presence <- species_occurrences_all
# adding col presence to determine presence for model training
species_presence <- cbind(species_presence, data.frame(presence = factor(1)))
rm(species_occurrences_all)
As absence points we will user random Points in Pakistan and combine them with the presence points. The absence points will be extended by a column “species”, which matches the column “species”´in the presence points.
# Generate random points inside pakistan as background points and extend them with a column for species = NA
# TODO: why 1000 points?? give a explanation for the decision
border_pak <- sf::st_as_sf(border_pak)
species_absence <- sf::st_sample(border_pak, size = 1000)
# adding col species = NA to the background points, needed for rbind to join the data
# adding col presence to determine absence for model training
species_absence <- cbind(species_absence, data.frame(species = as.character(NA)))
species_absence <- cbind(species_absence, data.frame(presence = factor(0)))
species_absence <- sf::st_as_sf(species_absence)
# Combine presence and absence (background) points into a single object
modeling_data <- rbind(species_presence, species_absence)
# Only points inside Pakistan should be used for modeling, also remove the columns added by the intersection.
modeling_data <- sf::st_intersection(modeling_data, border_pak) %>% select(-COUNTRY, -GID_0)
Now we got our presence and absence points as spatial data. Finally we will extract values from our environmental data and add those to our modeling data, so xgboost can use this table to train its model.
# Extract values from bioclim and elevation, join them to our modeling_data
extraction_bioclim_pak <- terra::extract(bioclim_pak, modeling_data, bind=FALSE, ID=FALSE)
extraction_elevation_pak <- terra::extract(elevation_pak, modeling_data, bind=FALSE, ID=FALSE)
modeling_data_extracted <- cbind( modeling_data, extraction_bioclim_pak, extraction_elevation_pak)
Clean up of no longer needed variables and check the final modeling data:
# create a final data variable and clean up variables
modeling_data <- modeling_data_extracted
rm(species_occurrences_all); rm(species_presence); rm(species_absence); rm(modeling_data_); rm(extraction_bioclim_pak); rm(extraction_elevation_pak); rm(modeling_data_extracted)
## Warning in rm(species_occurrences_all): Objekt 'species_occurrences_all' nicht
## gefunden
## Warning in rm(modeling_data_): Objekt 'modeling_data_' nicht gefunden
str(modeling_data)
ggplot() +
geom_sf(data = sf::st_as_sf(border_pak), fill=NA, show.legend=FALSE) +
geom_sf(data = sf::st_as_sf(modeling_data), mapping=aes(color=species), show.legend = FALSE) +
ggtitle("Oberserved occurrence of butterflies in Pakistan plus background points")
## Classes 'sf' and 'data.frame': 1397 obs. of 23 variables:
## $ species : chr "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" "Acraea_violae" ...
## $ presence : Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
## $ bio_1 : num 17.7 12.2 15.4 24.2 13.9 ...
## $ bio_2 : num 12.1 10.4 11.3 14.5 11.6 ...
## $ bio_3 : num 37.9 36.5 37.7 38.3 38 ...
## $ bio_4 : num 719 682 682 833 712 ...
## $ bio_5 : num 33.8 26.1 30.1 41.7 28.8 ...
## $ bio_6 : num 1.8 -2.5 0 3.9 -1.7 ...
## $ bio_7 : num 32 28.6 30.1 37.8 30.5 ...
## $ bio_8 : num 24.7 19.2 22 31.7 21.2 ...
## $ bio_9 : num 13.98 9.37 12.07 19.13 10.82 ...
## $ bio_10 : num 26 20.1 23.2 33.1 22.2 ...
## $ bio_11 : num 8.3 3.37 6.37 13.07 4.53 ...
## $ bio_12 : num 1358 1376 1005 312 1130 ...
## $ bio_13 : num 275 239 169 85 186 190 124 86 181 258 ...
## $ bio_14 : num 28 26 22 3 23 23 22 15 25 17 ...
## $ bio_15 : num 68.9 57.8 55.9 99.8 54 ...
## $ bio_16 : num 628 585 413 190 448 460 310 224 423 593 ...
## $ bio_17 : num 119 133 99 16 113 103 87 59 107 81 ...
## $ bio_18 : num 619 572 406 187 441 451 146 91 421 563 ...
## $ bio_19 : num 251 259 195 38 220 205 174 133 208 187 ...
## $ PAK_elv_msk: num 1219 2315 1581 166 1980 ...
## $ geometry :sfc_POINT of length 1397; first list element: 'XY' num 73.1 34.4
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr [1:22] "species" "presence" "bio_1" "bio_2" ...
species_nsamples = data.frame(modeling_data) %>%
count(species, sort=TRUE) %>%
rename(n_samples = n) %>%
filter(!is.na(species))
ggplot(species_nsamples, aes(n_samples)) +
geom_histogram(binwidth = 4) +
geom_vline(aes(xintercept=mean(n_samples)), linetype="dashed") +
annotate(x=mean(species_nsamples$n_samples), y=+Inf, label=paste("Mean:",round(mean(species_nsamples$n_samples),2)), vjust=3, geom="label") +
labs(x = "Number of Samples", y = "Number of Species")
rm(species_nsamples)
TODO it would be better to buffer the presence points and remove close absence points
Starting with the training of our xgboost model we decided to do sperate training for every model. Despite that xgboost is capable of Multi-Classificiation. Therefor we defined a function ‘train’ wich invokes out data filtering and xgboost specific data preperation to meet the requirements of xgboost. Espacially converting the modeling data into a ‘xgb.DMatrix’ object. Finally we define some general parameters we want to use, e.g. enable of gpu acceleration by knit paramters. Last but not least we save the model to the disk to preserve it for modeling.
https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.core DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.
train <- function(
data, # training data, required
nrounds,
xgb_params = list(), # xgboost params, see https://xgboost.readthedocs.io/en/latest/parameter.html
sp, # species name, only required if save_model = TRUE
save_model = TRUE
) {
# xgboost needs training data in a specific data format, which provides memory and speed optimization
data <- xgb.DMatrix(
data = as.matrix(as.data.frame(data) %>% select(-species, -presence, -geometry)),
label = as.matrix(as.data.frame(data) %>% select(presence))
)
# PARAM gpu_acc:
if(params$gpu_acc){ xgb_params = c(xgb_params, tree_method = "gpu_hist") }
else{ xgb_params = c(xgb_params, tree_method = "hist") }
xgb_params = c(xgb_params, objective = "binary:logistic") # logistic regression for binary classification, output probability
xgb_params = c(xgb_params, eval_metric = "auc")
model <- xgboost(data = data,
nrounds = nrounds,
params = xgb_params,
verbose = 0
)
message("avg train-auc:", as.numeric(mean(model[["evaluation_log"]][["train_auc"]])))
if(save_model)
{
xgb.save(model, paste("out/", sp, ".model", sep = "")) # Max compatibility with future xgboost versions
#save(model, file = paste("out/", sp, ".rds", sep = "")) # Fully restorable r object
}
return(model)
}
Additional training paramters are defined here in a own data frame. Advantage of this are the comparability and useability of the different parameter sets. The Table show the default parameter, one set taken from Roozbeh Valavi et. al. And last the parameters we will user for training. Those have been tested and tuned manually.
Next is to perpare data used to predic species occurrence in pakistan. Therefore we will use the raw raster data and predict of those with ‘terra::predict’ which allows us to pass on a ‘Spatraster’ object. XGBoost can’t handle Spatraster, so ‘terra:predict’ allows as to define a custom prediction function, which converts data into a matrix.
# gen stack from rasters bioclim_pak and elevation_pak
elev_pak = resample(elevation_pak, bioclim_pak)
ext(elevation_pak) <- ext(bioclim_pak)
prediction_rstack = c(bioclim_pak, elev_pak)
# Remove values outside pakistan, because otherwise the model will make predictions outside the modeling area
prediction_rstack = terra::mask(prediction_rstack, border_pak)
# We need to make a custom predict function for terra::predict() since xgboost didn't take a data.frame as input. See https://stackoverflow.com/questions/71947124/predict-xgboost-model-onto-raster-stack-yields-error
predict_custom <- function(model, data, ...)
{
stats::predict(model, newdata=as.matrix(data), ...)
}
predict <- function(model, # xgboost model as object
prediction_data # prediction data as spatraster stack
)
{
#model = xgb.load(paste("out/", sp, "/" ,sp, ".model", sep = ""))
#model = readRDS(paste("out/", sp, "/" ,sp, ".bin", sep = ""))
prediction = terra::predict(object=prediction_data,
model=model,
fun=predict_custom
)
return(prediction)
}
Now we can train and predict our first species:
sp = "Aglais_caschmirensis"
#TODO
# filter modeling data to current species, don't forget the absence points!
data <- modeling_data %>% filter(species == sp | is.na(species))
model <- train(data, nrounds = 10, save_model = FALSE)
## avg train-auc:0.991101886792453
prediction <- predict(model, prediction_rstack)
ggplot() +
geom_spatraster(data = prediction) +
scale_fill_hypso_c(direction = -1,
limits=c(0,1),
name = "Prediction") +
geom_sf(data = modeling_data %>% filter(species == sp),
size = 1,
shape = 1 ) +
geom_sf(data = sf::st_as_sf(border_pak),
fill = NA, show.legend=FALSE) +
ggtitle(paste("Oberserved and predicted occurrence of", sp, "in pakistan"))
## SpatRaster resampled to ncells = 500703
Next we have on look on some metrics: 1. Cross Entropy Loss https://www.youtube.com/watch?v=Pwgpl9mKars As logloss is very high and the map
ggplot(model[["evaluation_log"]]) +
geom_point(aes(x=iter,y=train_auc))
xgb.ggplot.shap.summary(data = as.matrix(as.data.frame(modeling_data) %>% select(-species, -presence, -geometry)), model = model )
rm(sp); rm(data); rm(model); rm(prediction)
Let’s repeat this with improved parameters: nrounds = 100
## avg train-auc:0.999064716981132
## SpatRaster resampled to ncells = 500703
Let’s repeat this wiht improved parameters: nrounds = 1000
## avg train-auc:0.999906471698113
## SpatRaster resampled to ncells = 500703
Overfitting!! Let’s repeat this wiht improved parameters: nrounds = 1000
Desrcibe Parameters
## avg train-auc:0.990339748427673
## SpatRaster resampled to ncells = 500703
if(plot) {
#xgb_model[["evaluation_log"]]
importance <- xgb.importance(model = model)
#xgb.plot.importance(importance_matrix = importance)
#xgb.ggplot.deepness(xgb_model)
#xgb.plot.multi.trees(mode = xgb_model, features_keep = 3)
#library("DiagrammeRsvg", "rsvg")
#gr <- xgb.plot.multi.trees(model=xgb_model, features_keep = 5, render=FALSE)
#DiagrammeR::export_graph(gr, 'tree.pdf', width=600, height=1500)
xgb.ggplot.shap.summary(data = as.matrix(modeling_data %>% select(-species_occurrence, -geometry)), model = model )
ggsave(paste(sp, "_shap.png", sep=""), path=paste("out/",sp, sep=""))
rm(importance)
}
Finally we combine all predicted species occurrence into a Map that
indicates how many species might occur in one pixel. First, we define a
threshold above which the prediction should be considered. The
prediction have been saved as tif in ‘out/
chunk_start <- proc.time()
rm(l_auc)
l_auc <- data.frame(species = character(),
auc = numeric())
species = (modeling_data %>% distinct(species) %>% filter(!is.na(species)))$species
# xgboost needs the output dir to exist before saving model
dir.create(path="out")
for(sp in species)
{
loop_start <- proc.time()
message("[", match(sp, species), "/", length(species), "] ", sp, ": ", appendLF=F)
data <- modeling_data %>% filter(species == sp | is.na(species))
model <- train(data,
nrounds = 2000,
xgb_params = list(eta = 0.15,
max_depth = 8,
gamma = 1.5,
alpha = 0,
lambda = 0.7,
#tree_method = "gpu_hist",
#predictor="gpu_predictor",
min_child_weight = 1),
save_model = FALSE)
prediction <- predict(model, prediction_rstack)
l_auc[nrow(l_auc) + 1,] = c(sp, as.numeric(mean(model[["evaluation_log"]][["train_auc"]])) )
terra::writeRaster(prediction, paste("out/", sp, ".tif", sep = ""), overwrite=TRUE)
ggplot() +
geom_spatraster(data = prediction) +
scale_fill_hypso_c(direction = -1,
limits=c(0,1),
name = "Prediction") +
geom_sf(data = modeling_data %>% filter(species == sp),
size = 1,
shape = 1 ) +
geom_sf(data = sf::st_as_sf(border_pak),
fill = NA, show.legend=FALSE) +
ggtitle(paste("Oberserved and predicted occurrence of", sp, "in pakistan"))
ggsave(paste(sp, ".png", sep=""), path="out")
ggplot(model[["evaluation_log"]]) +
geom_point(aes(x=iter,y=train_auc))
ggsave(paste(sp, "_log.png", sep=""), path="out")
message("\nExecution took ", (proc.time() - loop_start)[3], " seconds")
}
## [1/25] Acraea_issoria: avg train-auc:0.99818205
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.5 seconds
## [2/25] Acraea_violae: avg train-auc:0.986400583333333
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.91 seconds
## [3/25] Actinor_radians: avg train-auc:0.995415083333333
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.05 seconds
## [4/25] Acytolepis_puspa: avg train-auc:0.99999525
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 19.59 seconds
## [5/25] Aeromachus_stigmatus: avg train-auc:0.998671027777778
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 19.63 seconds
## [6/25] Aglais_caschmirensis: avg train-auc:0.995068429245283
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 21.64 seconds
## [7/25] Aglais_rizana: avg train-auc:0.998809380952381
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 22.28 seconds
## [8/25] Anaphaeis_aurota: avg train-auc:0.991001024038462
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 24.86 seconds
## [9/25] Apolria_nabellica: avg train-auc:0.9968332
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.81 seconds
## [10/25] Aporia_belucha: avg train-auc:0.996009045454545
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.83 seconds
## [11/25] Aporia_soracta: avg train-auc:0.999098982142857
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 22.95 seconds
## [12/25] Appias_albina: avg train-auc:0.5
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.36 seconds
## [13/25] Appias_libythea: avg train-auc:0.9799685
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 21.05 seconds
## [14/25] Argynnis_aglaja: avg train-auc:0.997748507575758
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.15 seconds
## [15/25] Argynnis_childreni: avg train-auc:0.997203609375
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 19.96 seconds
## [16/25] Argynnis_jainadeva: avg train-auc:0.99949064516129
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.26 seconds
## [17/25] Argynnis_kamala: avg train-auc:0.999459710526316
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 19.91 seconds
## [18/25] Argynnis_niobe: avg train-auc:0.5
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.05 seconds
## [19/25] Argynnis_pandora: avg train-auc:0.99776975
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 19.37 seconds
## [20/25] Argyreus_hyperbius: avg train-auc:0.992652638157895
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.14 seconds
## [21/25] Arhopala_rama: avg train-auc:0.99853975
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 24.0600000000001 seconds
## [22/25] Arhoplala_dodonaea: avg train-auc:0.997440525
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.8299999999999 seconds
## [23/25] Ariadne_ariadne: avg train-auc:0.5
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.33 seconds
## [24/25] Ariadne_merione: avg train-auc:0.990430384615385
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.48 seconds
## [25/25] Aglais_ladakensis: avg train-auc:0.5
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 19.9 seconds
message("Total Execution took ", (proc.time() - chunk_start)[3], " seconds")
## Total Execution took 520.99 seconds
rm(chunk_start); rm(species); rm(data); rm(model); rm(prediction)
species_nsamples = data.frame(modeling_data) %>%
count(species, sort=TRUE) %>%
rename(n_samples = n) %>%
filter(!is.na(species))
N_auc <- merge(l_auc, species_nsamples, by="species")
N_auc$auc = as.numeric(N_auc$auc)
ggplot(N_auc, aes(x = n_samples, y = auc)) +
geom_point() +
#ylim(0,1) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#geom_smooth(aes(colour = "Interpolation"))
#TODO: overall performance: mean of logloss
# Step 1 generate Raster Stack
# l_species will consist of all 421 species prediction rasters -> RAM usage will be insane ~ 25GB
l_species <- list()
threshold <- 0.5
# reclassify raster:
# value < threshold = 0
# value > threshold = 1
m <- c(0, threshold, 0,
threshold, 1, 1)
m <- matrix(m, ncol=3, byrow=TRUE)
for(sp in (modeling_data %>% distinct(species) %>% filter(!is.na(species)))$species)
{
# get species raster from file system
r <- rast(paste("./out/",sp,".tif", sep = "" ))
l_species[sp] <- terra::classify(r, m, include.lowest = TRUE)
rm(r)
}
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
stack = terra::rast(l_species)
stack = sum(stack)
stack = terra::mask(stack, border_pak)
ggplot() +
geom_spatraster(data = stack) +
scale_fill_hypso_c(palette="spain", name="N° of species" ) +
#scale_fill_hypso_b(name="N° of species") +
geom_sf(data = sf::st_as_sf(border_pak), fill=NA, show.legend=FALSE) +
ggtitle("Species richness of butterflies in Pakistan")
## SpatRaster resampled to ncells = 500703
ggsave("SpeciesRichnessMap.png", path="out")
## Saving 7 x 5 in image
For how many species is no prediction made? -> count(max(raster) = 0)
Silge, J.(2021): Use racing methods to tune xgboost models and predict home runs. https://juliasilge.com/blog/baseball-racing/ (last visited 07.07.2023)
Installing XGBoost with GPU accerlation
Pre-built binary packages are offered by XGBoost after someone makes
a request on GitHub (https://github.com/dmlc/xgboost/issues/6654). Since the
packages are precompiled, it should not be necessary to compile them
from source. The installation still fails with error
rknitr::inline_expr(“Warning in system(”sh
./configure.win”) ‘sh’ not found”)`. So there are two ways to fix this
problem: - Install R-Tools and build from source - Copy src/xgboost.dll
from archive (https://github.com/dmlc/xgboost/releases) into your r
library manually e.g C:%USERNAME%-library\4.2
Sessioninfo
sessionInfo()
## R version 4.2.3 (2023-03-15 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] xgboost_1.7.5.1 fastDummies_1.6.3 tidyterra_0.4.0 sf_1.0-12
## [5] geodata_0.5-8 terra_1.7-29 ggplot2_3.4.2 dplyr_1.1.2
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 xfun_0.39 bslib_0.4.2 purrr_1.0.1
## [5] splines_4.2.3 lattice_0.20-45 colorspace_2.1-0 vctrs_0.6.2
## [9] generics_0.1.3 viridisLite_0.4.2 htmltools_0.5.5 mgcv_1.8-42
## [13] s2_1.1.3 yaml_2.3.7 utf8_1.2.3 rlang_1.1.0
## [17] e1071_1.7-13 jquerylib_0.1.4 pillar_1.9.0 glue_1.6.2
## [21] withr_2.5.0 DBI_1.1.3 wk_0.7.3 lifecycle_1.0.3
## [25] munsell_0.5.0 gtable_0.3.3 ragg_1.2.5 codetools_0.2-19
## [29] evaluate_0.21 labeling_0.4.2 knitr_1.42 fastmap_1.1.1
## [33] class_7.3-21 fansi_1.0.4 highr_0.10 Rcpp_1.0.10
## [37] KernSmooth_2.23-20 scales_1.2.1 classInt_0.4-9 cachem_1.0.7
## [41] jsonlite_1.8.5 systemfonts_1.0.4 farver_2.1.1 textshaping_0.3.6
## [45] digest_0.6.31 grid_4.2.3 cli_3.6.1 tools_4.2.3
## [49] magrittr_2.0.3 sass_0.4.6 proxy_0.4-27 tibble_3.2.1
## [53] tidyr_1.3.0 pkgconfig_2.0.3 Matrix_1.5-3 data.table_1.14.8
## [57] rmarkdown_2.21 rstudioapi_0.14 R6_2.5.1 nlme_3.1-162
## [61] units_0.8-1 compiler_4.2.3